Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

306 ◾ Bioinformatics

ends of the reads, and to remove adaptors and duplicates. Refer to Chapter 1 for detailed

information about this step. For multiplexing data, you need to perform demultiplexing

before you do the quality control. The multiplexing and demultiplexing are discussed in

Chapter 7. The FASTQ files, which we have downloaded, had already been processed and

they contain reads of good quality. You can check their quality with FastQC as follows:

fqs=$(ls fastqdir/*.fastq)

fastqc $fqs

htmls=$(ls fastqdir/*.html)

firefox $htmls

The above commands will display the quality control report of the six FASTQ files on the

Firefox browser. Check on the six tabs to study the reports.

8.2.3 Removing Host DNA Reads

Metagenomic data recovered from clinical samples is usually mixed with the host genomic

DNA sequences. These sequences, which represent untargeted fraction of data, must be

filtered out before the subsequent step of the analysis. Any other untargeted sequences can

also be removed following the step of removing host sequences. The process of removing

the host sequences begins by aligning raw data to the reference genome of the host. The

host sequences will map to the reference genome, whereas the metagenomic reads will not

map. Thus, after the mapping process, we can extract the unmapped sequences and store

them in separate FASTQ files. For paired-end reads, we will have two FASTQ files repre-

senting the raw metagenomic data without the host sequences.

Since the host of our data is human, we will align reads to the human reference genome.

We have already discussed read mapping in Chapter 2 and other chapters as well. This

time we will use Bowtie2 aligner. We can walk you through the steps without repeating

the discussion. The following are the steps to remove the human host sequences from the

genomic data.

8.2.3.1 Download Human Reference Genome

You did this step before in Chapter 6. So, if you have the human reference genome and

Bowtie2 index saved in your drive, you can use them instead since building the Bowtie2

index may take some time. If you do not have those files stored on your computer, run the

following command in your project working directory to download the FASTA sequence of

the human reference genome, decompress it, and index it with both Samtools and Bowtie2:

mkdir ref; cd ref

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.

fa.gz

gunzip -d hg19.fa.gz

samtools faidx hg19.fa

bowtie2-build hg19.fa hg19

cd ..